Item difficulty & discrimination: Exploring the psychometric properties using R

This blog post explains the importance of examining item difficulty and discrimination in the development and validation of psychological measures and provides a step-by-step guide on how to calculate item difficulty and discrimination in R using an example dataset.

ShinyItemAnalysis
multilevel
Author
Affiliation

Ali Emre Karagül

TOBB ETU- University of Economics & Technology

Published

April 3, 2023

Mailchimp Subscription Modal

Introduction

Developing and validating psychological measures requires examining their psychometric properties, including their reliability and validity. One aspect of a measure’s validity is its item difficulty, which refers to how easy or difficult each individual item is for respondents to answer correctly. Another important aspect is item discrimination, which measures the extent to which each item distinguishes between participants who have high or low levels of the construct being measured.

Understanding item difficulty and item discrimination is crucial for several reasons. First, items that are too easy or too difficult can limit the variability of responses, making it harder to discriminate between participants who have different levels of the construct being measured. Second, items with low discrimination may not effectively differentiate between participants with different levels of the construct, leading to decreased validity.

In this blog post, we’ll explore how to calculate item difficulty and item discrimination in R using an example dataset. We’ll explain what each of these psychometric properties are, why they’re important, and how to interpret the results. We’ll also discuss some limitations and considerations when examining these properties. So let’s get started!

The dataset is generated via this web application based on Item Response Theory 2-Parameter Logistic Model. The dataset consists of randomly generated 40 items with a sample size of 500. The a-parameters of the items vary between 0.8 and 1.3 while b-parameters vary between -3 and 3. The c and d parameters are fixed to 0 and 1 respectively for all items.

Here are the package(s) we use in this post:

Item Difficulty

Item difficulty is a psychometric property that measures how easy or difficult an item is for respondents to answer correctly. Examining item difficulty is important because it can help identify items that are too easy or too difficult, which can limit the variability of responses and make it harder to discriminate between participants who have different levels of the construct being measured.

The proportion of correct responses for each item is calculated and reported as the item difficulty value. This calculation can be done manually using spreadsheet software or programmatically using statistical software such as R or SPSS. R has many packages and functions out there that we can use to calculate item difficulties. Yet we will calculate them simply with colMeans() function.

Let’s start by introducing the dataset to R environment.

my_data<-read.csv("data for post about item difficulty and discrimination.csv",sep=";", header = TRUE)
head(my_data) 
  i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20 i21
1  0  1  1  1  0  0  1  0  0   1   1   0   1   1   1   1   0   0   0   0   0
2  0  1  1  1  0  0  1  0  0   1   0   0   1   0   1   1   1   1   0   1   0
3  0  1  1  1  0  0  1  1  1   1   1   0   1   1   1   1   1   1   1   1   0
4  1  1  0  0  0  0  0  0  0   1   0   0   1   1   0   1   1   0   0   0   0
5  0  1  1  1  0  0  0  0  1   1   1   1   1   1   1   1   1   1   0   0   0
6  0  1  1  1  1  1  1  1  1   1   1   1   1   1   0   1   1   0   0   0   1
  i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32 i33 i34 i35 i36 i37 i38 i39 i40
1   0   0   1   1   1   0   1   0   0   0   1   0   1   0   1   0   1   1   0
2   0   0   1   1   1   0   0   0   0   0   0   1   1   0   1   0   0   1   1
3   0   0   1   1   1   0   1   0   0   0   1   1   1   1   1   0   1   1   1
4   0   0   0   1   1   0   1   0   0   0   0   0   1   0   0   0   0   1   1
5   0   0   1   1   1   1   0   0   0   0   1   1   1   1   1   0   0   1   1
6   0   0   1   1   1   0   1   1   0   0   1   1   1   1   1   1   1   1   1

Now let’s get the item difficulties:

# Calculate item difficulty
item_difficulty <- colMeans(my_data)
item_difficulty
   i1    i2    i3    i4    i5    i6    i7    i8    i9   i10   i11   i12   i13 
0.212 0.918 0.910 0.782 0.112 0.394 0.742 0.400 0.862 0.870 0.774 0.386 0.882 
  i14   i15   i16   i17   i18   i19   i20   i21   i22   i23   i24   i25   i26 
0.682 0.594 0.928 0.722 0.384 0.174 0.346 0.512 0.176 0.100 0.682 0.900 0.882 
  i27   i28   i29   i30   i31   i32   i33   i34   i35   i36   i37   i38   i39 
0.264 0.536 0.148 0.044 0.046 0.918 0.638 0.946 0.610 0.800 0.146 0.566 0.928 
  i40 
0.952 

Interpreting the results of item difficulty is straightforward. Items with higher difficulty values indicate that they were easier for participants to answer correctly, while items with lower difficulty values were more difficult. In our example dataset, the output shows that item 40 had the highest difficulty value of 0.952, meaning that 95.2% of participants answered this item correctly. Item 30, on the other hand, had the lowest difficulty value of 0.044, meaning that only 4.4% of participants answered this item correctly.

It’s important to note that each construct should be evaluated within its own concept while interpreting item difficulties. Yet, for achievement tests, a generic classification might be defined as “easy” if the index is 0.85 or above; “moderate” if it is between 0.41 and 0.84; and “hard” if it is 0.40 or below. Also, item difficulty is not the only factor to consider when evaluating the quality of a measure. Items that are too easy or too difficult may still be valid and reliable, depending on the construct being measured and the purpose of the measure. However, examining item difficulty can provide valuable insights into the psychometric properties of the measure and inform decisions about item selection and revision.

Item Discrimination

Item discrimination is another important psychometric property that measures the extent to which each item differentiates between participants who have high or low levels of the construct being measured. It indicates how well an item distinguishes between participants with different levels of the construct.

It’s important to note that (just like item difficulties) each construct should be evaluated within its own concept while interpreting item discriminations. Yet, for achievement tests, a generic classification might be defined as “good” if the index is above 0.30; “fair” if it is between 0.10 and 0.30; and “poor” if it is below 0.10.

To obtain a value for item discrimination, there are several statistical approaches that we can utilize. Here we will discuss (1) correlation between item and total score with the item, (2) correlation between item and total score without the item, and (3) upper-lower groups index.

1. Correlation between item and total score with the item

This approach is based on calculating the point-biserial correlation coefficient (rpb) between each item and the total score of the measure. The total score is calculated by summing the scores of all items. The rpb ranges from -1 to 1, with values closer to 1 indicating higher discrimination.

Let’s first calculate the total score for each participant in our example dataset. Then, use a for loop to calculate rpb coefficients for each item:

#get the total score for each participant
total_score <- rowSums(my_data)

#There are 40 items in the test:
item_discrimination1 <- 40  

#calculate rpb for each item:
for(i in 1:40){        
  item_discrimination1[i] <- cor(total_score, my_data[,i])  
}
round(item_discrimination1,4)
 [1] 0.3984 0.2162 0.2308 0.4485 0.3019 0.4847 0.4148 0.4548 0.2796 0.3335
[11] 0.4498 0.5138 0.2735 0.4336 0.4423 0.1956 0.4288 0.4133 0.3297 0.4214
[21] 0.4909 0.3174 0.3222 0.4098 0.3312 0.3464 0.3737 0.4876 0.2481 0.2257
[31] 0.1783 0.3536 0.4094 0.2837 0.3893 0.3888 0.3962 0.5025 0.2866 0.1906

2. Correlation between item and total score without the item

This approach is very similar to the first one. The only difference is that when we calculate the correlation between an item and the total score, we do not include the item. This type of an approach will result in slightly reduced index values when compared to the first approach. Therefore, it is usually a more-preferred approach by test developers (we would love to stay in the safe-zone).

The package multilevel has a specific function for this index. The item.total() function takes only the dataset as input and provides us with a dataframe with four columns: item name, discrimination index, a reliability index without the item and the sample size. Yet, we only need the discrimination index. Here how we get it:

item_discrimination2<-multilevel::item.total(my_data)
item_discrimination2$Item.Total
 [1] 0.3365024 0.1703597 0.1832279 0.3887597 0.2511883 0.4160659 0.3493047
 [8] 0.3837360 0.2232991 0.2803945 0.3892913 0.4477891 0.2207879 0.3648121
[15] 0.3701690 0.1521115 0.3624807 0.3399690 0.2694241 0.3503333 0.4211616
[22] 0.2564267 0.2746220 0.3395723 0.2838892 0.2959223 0.3053936 0.4177549
[29] 0.1892085 0.1916937 0.1429237 0.3110586 0.3368093 0.2471547 0.3143658
[36] 0.3277659 0.3428951 0.4342855 0.2447900 0.1546114

3. Upper-lower groups index

Personally, I feel that this approach is the most meaningfully-related approach in terms of “discrimination”. That’s because while calculating it, we divide the whole group into sub-groups (usually three groups) according to their total scores. Then, we calculate the discrimination index for an item by comparing these groups’ responses to that item. This definition feels more like a discrimination index to my illiterate ears.

In R environment, ShinyItemAnalysis package has a specific function for this index. The gDiscrim() function has several arguements such as Data(the data set), k (the number of sub groups and 3 is default), l and u (numeric values to define the lower and upper groups and the defaults are 1 and 3 consecutively). There are other arguments that should be checked on the manual before using the function.

We simply use the function as:

item_discrimination3<-ShinyItemAnalysis::gDiscrim(my_data)
item_discrimination3
        i1         i2         i3         i4         i5         i6         i7 
0.35177700 0.11253051 0.12801251 0.42537370 0.20042709 0.57908786 0.40470561 
        i8         i9        i10        i11        i12        i13        i14 
0.56204240 0.19718578 0.26586333 0.43982611 0.56566504 0.21888347 0.50640635 
       i15        i16        i17        i18        i19        i20        i21 
0.49603417 0.11253051 0.45530811 0.44428768 0.26346095 0.46442190 0.56829622 
       i22        i23        i24        i25        i26        i27        i28 
0.26449054 0.20145668 0.43875839 0.20908328 0.25709274 0.36420836 0.59567572 
       i29        i30        i31        i32        i33        i34        i35 
0.17720409 0.08419768 0.07748627 0.19100824 0.45732916 0.12957596 0.41599298 
       i36        i37        i38        i39        i40 
0.37633466 0.29804759 0.54766626 0.13163514 0.09601891 

Note that if you change the number of subgroups, you should be careful while interpreting the results.

No matter what statistical approach you use to estimate discrimination indexes, it’s also important to note that item discrimination can be influenced by factors such as the sample size, the range of scores, and the homogeneity of the sample. Therefore, it’s recommended to examine item discrimination in conjunction with other psychometric properties such as item difficulty and reliability.

A Package of Personal Preferrance

While we can calculate both difficulty and discrimination indexes manually or by using different functions from different packages, my totally-subjective opinion is that ItemAnalysis() function from the ShinyItemAnalysis package gives a well-groomed output for many item statistics. The following code snippet simply provides us with many indexes including difficulty and three types of discrimination indexes:

#round is for rounding the values in the results.
item_stats<-round(ShinyItemAnalysis::ItemAnalysis(my_data),2)
#to see all the output in a table:
knitr::kable(item_stats)
Difficulty Mean SD Cut.score obs.min Min.score obs.max Max.score Prop.max.score RIR RIT Corr.criterion ULI gULI Alpha.drop Index.rel Index.val Perc.miss Perc.nr
i1 0.21 0.21 0.41 NA 0 0 1 1 0.21 0.34 0.40 NA 0.35 NA 0.83 0.16 NA 0 0
i2 0.92 0.92 0.27 NA 0 0 1 1 0.92 0.17 0.22 NA 0.11 NA 0.84 0.06 NA 0 0
i3 0.91 0.91 0.29 NA 0 0 1 1 0.91 0.18 0.23 NA 0.13 NA 0.84 0.07 NA 0 0
i4 0.78 0.78 0.41 NA 0 0 1 1 0.78 0.39 0.45 NA 0.43 NA 0.83 0.19 NA 0 0
i5 0.11 0.11 0.32 NA 0 0 1 1 0.11 0.25 0.30 NA 0.20 NA 0.84 0.10 NA 0 0
i6 0.39 0.39 0.49 NA 0 0 1 1 0.39 0.42 0.48 NA 0.58 NA 0.83 0.24 NA 0 0
i7 0.74 0.74 0.44 NA 0 0 1 1 0.74 0.35 0.41 NA 0.40 NA 0.83 0.18 NA 0 0
i8 0.40 0.40 0.49 NA 0 0 1 1 0.40 0.38 0.45 NA 0.56 NA 0.83 0.22 NA 0 0
i9 0.86 0.86 0.35 NA 0 0 1 1 0.86 0.22 0.28 NA 0.20 NA 0.84 0.10 NA 0 0
i10 0.87 0.87 0.34 NA 0 0 1 1 0.87 0.28 0.33 NA 0.27 NA 0.83 0.11 NA 0 0
i11 0.77 0.77 0.42 NA 0 0 1 1 0.77 0.39 0.45 NA 0.44 NA 0.83 0.19 NA 0 0
i12 0.39 0.39 0.49 NA 0 0 1 1 0.39 0.45 0.51 NA 0.57 NA 0.83 0.25 NA 0 0
i13 0.88 0.88 0.32 NA 0 0 1 1 0.88 0.22 0.27 NA 0.22 NA 0.84 0.09 NA 0 0
i14 0.68 0.68 0.47 NA 0 0 1 1 0.68 0.36 0.43 NA 0.51 NA 0.83 0.20 NA 0 0
i15 0.59 0.59 0.49 NA 0 0 1 1 0.59 0.37 0.44 NA 0.50 NA 0.83 0.22 NA 0 0
i16 0.93 0.93 0.26 NA 0 0 1 1 0.93 0.15 0.20 NA 0.11 NA 0.84 0.05 NA 0 0
i17 0.72 0.72 0.45 NA 0 0 1 1 0.72 0.36 0.43 NA 0.46 NA 0.83 0.19 NA 0 0
i18 0.38 0.38 0.49 NA 0 0 1 1 0.38 0.34 0.41 NA 0.44 NA 0.83 0.20 NA 0 0
i19 0.17 0.17 0.38 NA 0 0 1 1 0.17 0.27 0.33 NA 0.26 NA 0.84 0.13 NA 0 0
i20 0.35 0.35 0.48 NA 0 0 1 1 0.35 0.35 0.42 NA 0.46 NA 0.83 0.20 NA 0 0
i21 0.51 0.51 0.50 NA 0 0 1 1 0.51 0.42 0.49 NA 0.57 NA 0.83 0.25 NA 0 0
i22 0.18 0.18 0.38 NA 0 0 1 1 0.18 0.26 0.32 NA 0.26 NA 0.84 0.12 NA 0 0
i23 0.10 0.10 0.30 NA 0 0 1 1 0.10 0.27 0.32 NA 0.20 NA 0.84 0.10 NA 0 0
i24 0.68 0.68 0.47 NA 0 0 1 1 0.68 0.34 0.41 NA 0.44 NA 0.83 0.19 NA 0 0
i25 0.90 0.90 0.30 NA 0 0 1 1 0.90 0.28 0.33 NA 0.21 NA 0.83 0.10 NA 0 0
i26 0.88 0.88 0.32 NA 0 0 1 1 0.88 0.30 0.35 NA 0.26 NA 0.83 0.11 NA 0 0
i27 0.26 0.26 0.44 NA 0 0 1 1 0.26 0.31 0.37 NA 0.36 NA 0.83 0.16 NA 0 0
i28 0.54 0.54 0.50 NA 0 0 1 1 0.54 0.42 0.49 NA 0.60 NA 0.83 0.24 NA 0 0
i29 0.15 0.15 0.36 NA 0 0 1 1 0.15 0.19 0.25 NA 0.18 NA 0.84 0.09 NA 0 0
i30 0.04 0.04 0.21 NA 0 0 1 1 0.04 0.19 0.23 NA 0.08 NA 0.84 0.05 NA 0 0
i31 0.05 0.05 0.21 NA 0 0 1 1 0.05 0.14 0.18 NA 0.08 NA 0.84 0.04 NA 0 0
i32 0.92 0.92 0.27 NA 0 0 1 1 0.92 0.31 0.35 NA 0.19 NA 0.83 0.10 NA 0 0
i33 0.64 0.64 0.48 NA 0 0 1 1 0.64 0.34 0.41 NA 0.46 NA 0.83 0.20 NA 0 0
i34 0.95 0.95 0.23 NA 0 0 1 1 0.95 0.25 0.28 NA 0.13 NA 0.84 0.06 NA 0 0
i35 0.61 0.61 0.49 NA 0 0 1 1 0.61 0.31 0.39 NA 0.42 NA 0.83 0.19 NA 0 0
i36 0.80 0.80 0.40 NA 0 0 1 1 0.80 0.33 0.39 NA 0.38 NA 0.83 0.16 NA 0 0
i37 0.15 0.15 0.35 NA 0 0 1 1 0.15 0.34 0.40 NA 0.30 NA 0.83 0.14 NA 0 0
i38 0.57 0.57 0.50 NA 0 0 1 1 0.57 0.43 0.50 NA 0.55 NA 0.83 0.25 NA 0 0
i39 0.93 0.93 0.26 NA 0 0 1 1 0.93 0.24 0.29 NA 0.13 NA 0.84 0.07 NA 0 0
i40 0.95 0.95 0.21 NA 0 0 1 1 0.95 0.15 0.19 NA 0.10 NA 0.84 0.04 NA 0 0

The same package also provides us with a nice visualization function (DDplot) for item difficulty and discrimination of any approach stated above. It takes discrim argument that can be defined as RIT (the first approach), RIR (the second approach above) or ULI (the third approach). Also, you can define a threshold value to draw a line on the plot via the thr argument. Here is a sample usage for our case:

DDplot(my_data, discrim = 'ULI', k = 3, l = 1, u = 3, thr=0.1)

It can be seen that items 30, 31 and 40 from our simulated dataset are below our 0.1 threshold value in terms of discrimination. Interestingly, items 30 and 31 are the most difficult items while item 40 is the easiest one. What a weirdo… :D

Conclusion

Examining item difficulty and item discrimination are important aspects of evaluating the psychometric properties of a measure. Item difficulty measures how easy or difficult each individual item is for respondents to answer correctly, while item discrimination measures the extent to which each item differentiates between participants who have high or low levels of the construct being measured.

In this blog post, we explored how to calculate item difficulty and item discrimination in R using an example dataset. We explained what each of these psychometric properties are, why they’re important, and how to interpret the results. We also discussed some limitations and considerations when examining these properties.

Overall, understanding and evaluating the psychometric properties of a measure can help ensure its reliability and validity, and inform decisions about item selection and revision.